智能论文笔记

深度学习推荐模型（DLRM）是广泛的，占据了相当多的数据中心足迹，并每年增长超过1.5倍。使用模型尺寸很快在Tberytes范围内，利用存储类（SCM）的推理，可以降低功耗和成本。本文评估将内存层级扩展到DLRM的主要挑战，并提出了通过软件定义内存提高性能的不同技术。我们展示了基础技术，如NAND Flash和3DXP的差异化，并涉及现实世界场景，从而可以节省5％至29％。

translated by 谷歌翻译

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , Ping Tak Peter Tang

分类：

2016-09-15

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32-512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions-and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

translated by 谷歌翻译